As a part of the IBM Data Science Professional Certificate, you will find in this post an overview of my final capstone project As prepared for the assignment, I go through the problem description, data preparation and final analysis section step by step.
Som Tum (Green papaya salad) is a spicy salad made from shredded unripe papaya. Originating from ethnic Lao people, it is also eaten throughout Southeast Asia. Locally known in Thailand as Som Tum (Thai: ส้มตำ, pronounced [sôm tām]), in Laos as tam som (Lao: ຕໍາສົ້ມ), or the more specific name tam maak hoong (Lao: ຕໍາໝາກຫຸ່ງ, pronounced [tàm.ma᷆ːk.hūŋ]), in Cambodia as bok l'hong (Khmer: បុកល្ហុង, pronounced [ɓok lhoŋ]), and in Vietnam as gỏi đu đủ.
Som Tum, the Thai variation, was listed at number 46 on World's 50 most delicious foods compiled by CNN Go in 2011[1] and 2018.
This final project explores the best locations for Som Tum restaurants throughout Bangkok.
Bangkok Thailand's capital of Bangkok was named the most visited city in the world, according to MasterCard. Following Bangkok, Paris (19.10 million) came as a close second, with London (19.01 million), Dubai (15.93 million) and Singapore (14.67 million) rounding out the top five most visited cities in the world in 2018.
This city is a famous place in the world. They are diverse in many ways multicultural as well as the financial hubs of countries.
Som Tum is the Isaan (Northeastern) Thai dishes you should try. I need to find and enjoy Som Tum and other Isaan food. This report explores which neighborhoods and districts of Bangkok have the most as well as the best Som Tum restaurants.
Additionally, I will attempt to answer the questions “Where should I open a Som Tum restaurant?” and “Where should I stay If I want great Isaan food?”
The objective of this project is to use Foursquare location data and regional clustering of venue information to determine what might be the ‘best’ neighborhood in Bangkok to open a restaurant. Som Tum and Isaan food are one of the most bought dishes in Bangkok.
Som Tum originating from Northeastern. Bangkok's population of over 8 million that migrated from around the country, there are numerous opportunities to open a new Som Tum restaurant. Through this project, I will find the most suitable location for an entrepreneur to open a new Som Tum and Isaan restaurant in Bangkok.
This project is aimed towards Entrepreneurs or Business owners who want to open a new Som Tum and Isaan Restaurant or grow their current business. The analysis will provide vital information that can be used by the target audience.
The success criteria of the project will be a good recommendation for a neighborhood choice to open a new Isaan restaurant in Bangkok.
The data that will be required will be a combination of CSV files that have been prepared for the purposes of the analysis from multiple sources which will provide the list of neighbourhoods in Bangkok (via Wikipedia) and Venue data pertaining to Som Tum restaurants (via Foursquare). The Venue data will help find which neighbourhood is best suitable to open a Som Tum restaurant.
3.1 — Data acquisition

Figure 1 : Wikipedia Page showing List of Neighborhoods in Bangkok with respective Postal Codes
Source 1: List of districts of Bangkok via Wikipedia
https://en.wikipedia.org/wiki/List_of_districts_of_Bangkok
The Wikipedia site shown above provided almost all the information about the neighborhoods. It included the postal code, distrcit, and the name of the neighborhoods present in Bangkok.
Since the data is not in a format that is suitable for analysis, scraping of the data was done from this site.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import requests
import json
from bs4 import BeautifulSoup
import matplotlib.cm as cm
import matplotlib.colors as colors
%matplotlib inline
print('Packages installed :)')
I used BeautifulSoup and Pandas to scrape boroughs from Wikipedia and organize a table containing District(Khet),Post-code, Latitude and Longitude information of Bangkok.
url = 'https://en.wikipedia.org/wiki/List_of_districts_of_Bangkok'
df = pd.read_html(url, header=0)
df_bkk = df[0]
df_bkk.head(12)
Dropping unnecessary data.
df_bkk.drop(['MapNr', 'Thai', 'No. ofSubdis-trictsKhwaeng'], axis=1, inplace=True)
df_bkk.head(12)
Rename some columns for easy referencing.
df_bkk.rename(columns = {"District(Khet)": "District", "Post-code": "Postcode","Popu-lation": "Population"}, inplace = True)
df_bkk.head()
I used python folium library to visualize geographic details of Bangkok and its 50 districts and I created a map of Bangkok with districts superimposed on top and used latitude and longitude values of "The National Highway's Kilometre Zero (The democracy monument)" to get the visual as below:
import folium
lat = 13.757083
long = 100.502084
map = folium.Map(location=[lat,long], zoom_start=10)
for lat,lng,District in zip(
df_bkk['Latitude'],
df_bkk['Longitude'],
df_bkk['District']):
label = '{}'.format(District)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='red',
fill=True,
fill_color='yellow',
fill_opacity=0.2,
parse_html=False).add_to(map)
map
I utilized the Foursquare API centering these pizza places to explore their neighborhoods with a 1,500 meter radius.
CLIENT_ID =
CLIENT_SECRET =
VERSION = '20180605'
Get the top 100 venues that are in Bangkok within a radius of 1,500 meters.
def getNearbyVenues(names, latitudes, longitudes, radius=500):
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1500 # define radius
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
#Get venues for all neighborhoods in our dataset
bkk_venues = getNearbyVenues(names=df_bkk['District'],
latitudes=df_bkk['Latitude'],
longitudes=df_bkk['Longitude'])
Firstly, I will use exploratory data analysis(EDA) to uncover hidden properties of data and provide useful insights to the reader, both future traveler and investor.
4.2.1 Using Foursquare Location Data Finally, let’s make use of Foursquare API and get the top 100 venues that are in Bangkok within a radius of 1,500 meter
bkk_venues.head()
How many venues per neighborhood?
bkk_venues.groupby('Neighborhood').count()
How many unique venues are there?
print('There are {} uniques categories.'.format(len(bkk_venues['Venue Category'].unique())))
Are there any Som Tum Restaurants in the venues?
"Som Tum Restaurant" in bkk_venues['Venue Category'].unique()
Then to analyze the data I performed a technique in which Categorical Data is transformed into Numerical Data for Machine Learning algorithms. This technique is called One hot encoding. For each of the neighborhoods, individual venues were turned into the frequency at how many of those Venues were located in each neighborhood.
# one hot encoding
bkk_onehot = pd.get_dummies(bkk_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
bkk_onehot['Neighborhood'] = bkk_venues['Neighborhood']
# move neighborhood column to the first column
fixed_columns = [bkk_onehot.columns[-1]] + list(bkk_onehot.columns[:-1])
bkk_onehot = bkk_onehot[fixed_columns]
bkk_onehot.head()
group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
bkk_grouped = bkk_onehot.groupby(["Neighborhood"]).mean().reset_index()
print(bkk_grouped.shape)
bkk_grouped.head()
K-Means Clustering
To make the analysis more interesting, I wanted to cluster the neighborhoods based on the neighborhoods that had similar averages of Som Tum Restaurants in that Neighborhood.
To do this I used K-Means clustering. To get our optimum K value that was neither overfitting nor underfitting the model, I used the Elbow Point Technique. In this technique, I ran a test with a different number of K values and measured the accuracy and then chose the best K value.
The best K value is chosen at the point in which the line has the sharpest turn.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=15, random_state=8)
X = som_tum.drop(['Neighborhood'], axis=1)
kmeans.fit(X)
kmeans.labels_[0:10]
def get_inertia(n_clusters):
km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=15, random_state=8)
km.fit(X)
return km.inertia_
scores = [get_inertia(x) for x in range(2, 21)]
plt.figure(figsize=[10, 8])
sns.lineplot(x=range(2, 21), y=scores, color='r')
plt.title("K vs Error")
plt.xticks(range(2, 21))
plt.xlabel("K")
plt.ylabel("Error")
Then I used a model that accurately pointed out the optimum K value. I imported ‘KElbowVisualizer’ from the Yellowbrick package. Then I fit our K-Means model above to the Elbow visualizer.
from yellowbrick.cluster import KElbowVisualizer
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,21))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.show()
I see that the optimum K value is 6 so I will have a resulting of 4 clusters.
kclusters = 6
bkk_grouped_clustering = som_tum.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(bkk_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
somtum_merged = som_tum.copy()
# add clustering labels
somtum_merged["Cluster Labels"] = kmeans.labels_
somtum_merged.head()
# merge bkk_grouped with bkk_data to add latitude/longitude for each neighborhood
somtum_merged = somtum_merged.join(bkk_venues.set_index("Neighborhood"), on="Neighborhood")
print(somtum_merged.shape)
somtum_merged.head()
# sort the results by Cluster Labels
print(somtum_merged.shape)
somtum_merged.sort_values(["Cluster Labels"], inplace=True)
somtum_merged
We see that there are a total of 50 districts of Bangkok with Som Tum restaurants's location.
We will create a new map with the neighborhood and Som Tum Restaurants.
# create map
map_clusters = folium.Map(location=[lat, long], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(somtum_merged['Neighborhood Latitude'], somtum_merged['Neighborhood Longitude'], somtum_merged['Neighborhood'], somtum_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster))
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill_color=rainbow[cluster-1],
fill_opacity=0.8).add_to(map_clusters)
map_clusters
How many Neighborhoods per Cluster?
df_bkk["Cluster Labels"] = kmeans.labels_
df_bkk.head(20)
objects = (1,2,3,4,5,6)
y_pos = np.arange(len(objects))
performance = somtum_merged['Cluster Labels'].value_counts().to_frame().sort_index(ascending=True)
perf = performance['Cluster Labels'].tolist()
plt.bar(y_pos, perf, align='center', alpha=0.8, color=['blue', 'YELLOW','LIME','TEAL','Red','FUCHSIA'])
plt.xticks(y_pos, objects)
plt.ylabel('No of Neighborhoods')
plt.xlabel('Cluster No.')
plt.title('How many Neighborhoods per Cluster')
plt.show()
somtum_merged.rename(columns={'Neighborhood':'District'},inplace=True)
df_new = df_bkk[['District']]
cluster1 = somtum_merged.loc[somtum_merged['Cluster Labels'] == 0]
df_cluster1 = pd.merge(df_new, cluster1, on='District')
df_cluster1
cluster2 = somtum_merged.loc[somtum_merged['Cluster Labels'] == 1]
df_cluster2 = pd.merge(df_new, cluster2, on='District')
df_cluster2
cluster3 = somtum_merged.loc[somtum_merged['Cluster Labels'] == 2]
df_cluster3 = pd.merge(df_new, cluster3, on='District')
df_cluster3
cluster4 = somtum_merged.loc[somtum_merged['Cluster Labels'] == 3]
df_cluster4 = pd.merge(df_new, cluster4, on='District')
df_cluster4
cluster5 = somtum_merged.loc[somtum_merged['Cluster Labels'] == 4]
df_cluster5 = pd.merge(df_new, cluster5, on='District')
df_cluster5
cluster6 = somtum_merged.loc[somtum_merged['Cluster Labels'] == 5]
df_cluster6 = pd.merge(df_new, cluster6, on='District')
df_cluster6
clusters_mean = [df_cluster1['Som Tum Restaurant'].mean(),df_cluster2['Som Tum Restaurant'].mean(),df_cluster3['Som Tum Restaurant'].mean(),
df_cluster4['Som Tum Restaurant'].mean(),df_cluster5['Som Tum Restaurant'].mean(),df_cluster6['Som Tum Restaurant'].mean()]
objects = (1,2,3,4,5,6)
y_pos = np.arange(len(objects))
perf = clusters_mean
plt.bar(y_pos, perf, align='center', alpha=0.8, color=['blue', 'YELLOW','LIME','TEAL','Red','FUCHSIA'])
plt.xticks(y_pos, objects)
plt.ylabel('Mean')
plt.xlabel('Cluster No.')
plt.title('Average number of Som Tum Restaurants per Cluster')
plt.show()
Most of the Som tum Restaurants are in cluster 5 represented by the red clusters but In cluster 4 there is little to no Som tum Restaurant.
You can look for nearby venues, the optimum place to put a new Som tum Restaurant there are many neighborhoods in the area that little to no Som tum Restaurants.
Eliminating any competition. The second best Neighborhoods that have a great opportunity would be in areas such as Samphanthawong, Watthana, Khlong San, etc.
Which is in Cluster 4. Having 1,260 neighborhoods in the area with no Som tum Restaurants. That gives a good opportunity for opening up a new restaurant.
This concludes the optimal findings for this project and recommends the entrepreneur to open a Som tum restaurant in these locations with little to no competition.
Nonetheless, if the food is authentic, affordable, and good taste, I am confident that it will have a great following everywhere.